AITopics | computer architecture

Collaborating Authors

computer architecture

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting

Yu, Zhongkai, Guan, Yue, Yu, Zihao, Zhou, Chenyang, Hu, Zhengding, Pei, Shuyi, Kang, Yangwook, Ding, Yufei, Tsai, Po-An

arXiv.org Artificial IntelligenceDec-8-2025

Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. With our insights, we then demonstrate how to improve wafer-scale GPUs as a case study, and show that minor architectural modifications leveraging the insights achieve substantial performance gains, delivering 5.3x and 3.1x average speedups on DeepSeek V3 and Qwen3, respectively. Our work presents the first comprehensive data-centric analysis of large-scale MoE models and a concrete design study using the learned lessons, with profiling traces and simulation framework already open-sourced with $>$1k downloads. Our traces and results are publicly available at https://huggingface.co/datasets/core12345/MoE_expert_selection_trace

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.05497

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Semiconductors & Electronics (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

CAMformer: Associative Memory is All You Need

Molom-Ochir, Tergel, Morris, Benjamin F., Horton, Mark, Wei, Chiyue, Guo, Cong, Taylor, Brady, Liu, Peter, Wang, Shan X., Fan, Deliang, Li, Hai Helen, Chen, Yiran

arXiv.org Artificial IntelligenceNov-26-2025

Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelerator that reinterprets attention as an associative memory operation and computes attention scores using a voltage-domain Binary Attention Content Addressable Memory (BA-CAM). This enables constant-time similarity search through analog charge sharing, replacing digital arithmetic with physical similarity sensing. CAMformer integrates hierarchical two-stage top-k filtering, pipelined execution, and high-precision contextualization to achieve both algorithmic accuracy and architectural efficiency. Evaluated on BERT and Vision Transformer workloads, CAMformer achieves over 10x energy efficiency, up to 4x higher throughput, and 6-8x lower area compared to state-of-the-art accelerators--while maintaining near-lossless accuracy.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.1974

Country: North America > United States (0.46)

Genre:

Research Report (0.50)
Overview (0.46)

Industry: Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.71)
Information Technology > Artificial Intelligence > Systems & Languages > Programming Languages (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats

Chen, Yuzong, Fang, Chao, Dai, Xilai, Wu, Yuheng, Tambe, Thierry, Verhelst, Marian, Abdelfattah, Mohamed S.

arXiv.org Artificial IntelligenceNov-18-2025

The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing units (NPUs) with DRAM-based processing-in-memory (PIM) for LLM acceleration. However, existing high-precision (e.g., FP16) PIM compute units incur significant area and power overhead in DRAM technology, limiting the effective computation throughput. In this paper, we introduce P3-LLM, a novel NPU-PIM integrated accelerator for LLM inference using hybrid numerical formats. Our approach is threefold: First, we propose a flexible mixed-precision quantization scheme, which leverages hybrid numerical formats to quantize different LLM operands with high compression efficiency and minimal accuracy loss. Second, we architect an efficient PIM accelerator for P3-LLM, featuring enhanced compute units to support hybrid numerical formats. Our careful choice of numerical formats allows to co-design low-precision PIM compute units that significantly boost the computation throughput under iso-area constraints. Third, we optimize the low-precision dataflow of different LLM modules by applying operator fusion to minimize the overhead of runtime dequantization. Evaluation on a diverse set of representative LLMs and tasks demonstrates that P3-LLM achieves state-of-the-art accuracy in terms of both KV-cache quantization and weight-activation quantization. Combining the proposed quantization scheme with PIM architecture co-design, P3-LLM yields an average of $4.9\times$, $2.0\times$, and $3.4\times$ speedups over the state-of-the-art LLM accelerators HBM-PIM, Ecco, and Pimba, respectively. Our quantization code is available at https://github.com/yc2367/P3-LLM.git

large language model, machine learning, quantization, (19 more...)

arXiv.org Artificial Intelligence

2511.06838

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Role of Advanced Computer Architectures in Accelerating Artificial Intelligence Workloads

Amin, Shahid, Shah, Syed Pervez Hussnain

arXiv.org Artificial IntelligenceNov-14-2025

The remarkable progress in Artificial Intelligence (AI) is foundation-ally linked to a concurrent revolution in computer architecture. As AI models, particularly Deep Neural Networks (DNNs), have grown in complexity, their massive computational demands have pushed traditional architectures to their limits. This paper provides a structured review of this co-evolution, analyzing the architectural landscape designed to accelerate modern AI workloads. We explore the dominant architectural paradigms Graphics Processing Units (GPUs), Appli-cation-Specific Integrated Circuits (ASICs), and Field-Programmable Gate Ar-rays (FPGAs) by breaking down their design philosophies, key features, and per-formance trade-offs. The core principles essential for performance and energy efficiency, including dataflow optimization, advanced memory hierarchies, spar-sity, and quantization, are analyzed. Furthermore, this paper looks ahead to emerging technologies such as Processing-in-Memory (PIM) and neuromorphic computing, which may redefine future computation. By synthesizing architec-tural principles with quantitative performance data from industry-standard benchmarks, this survey presents a comprehensive picture of the AI accelerator landscape. We conclude that AI and computer architecture are in a symbiotic relationship, where hardware-software co-design is no longer an optimization but a necessity for future progress in computing.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.1001

Genre: Overview (1.00)

Industry:

Information Technology (1.00)
Semiconductors & Electronics (0.67)

Technology:

Information Technology > Architecture (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

Jamet, Alexandre Valentin, Vavouliotis, Georgios, Jiménez, Daniel A., Alvarez, Lluc, Casas, Marc

arXiv.org Artificial IntelligenceNov-4-2025

To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named First Level Predictor (FLP) and Second Level Predictor (SLP). FLP performs accurate off-chip prediction by using several program features based on virtual addresses and a novel selective delay component. The novelty of SLP relies on leveraging off-chip prediction to drive L1D prefetch filtering by using physical addresses and the FLP prediction as features. TLP constitutes the first hardware proposal targeting both off-chip prediction and prefetch filtering using a multi-level perceptron hardware approach. TLP only requires 7KB of storage. To demonstrate the benefits of TLP we compare its performance with state-of-the-art approaches using off-chip prediction and prefetch filtering on a wide range of single-core and multi-core workloads. Our experiments show that TLP reduces the average DRAM transactions by 30.7% and 17.7%, as compared to a baseline using state-of-the-art cache prefetchers but no off-chip prediction mechanism, across the single-core and multi-core workloads, respectively, while recent work significantly increases DRAM transactions. As a result, TLP achieves geometric mean performance speedups of 6.2% and 11.8% across single-core and multi-core workloads, respectively. In addition, our evaluation demonstrates that TLP is effective independently of the L1D prefetching logic.

artificial intelligence, machine learning, prediction, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/HPCA57654.2024.00046

2403.15181

Country: North America > United States (0.46)

Genre:

Research Report > Promising Solution (0.48)
Overview > Innovation (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.77)

Add feedback

FaRAccel: FPGA-Accelerated Defense Architecture for Efficient Bit-Flip Attack Resilience in Transformer Models

Nazari, Najmeh, Latibari, Banafsheh Saber, Hosseini, Elahe, Movafagh, Fatemeh, Fang, Chongzhou, Makrani, Hosein Mohammadi, Gubbi, Kevin Immanuel, Mahalanobis, Abhijit, Rafatirad, Setareh, Sayadi, Hossein, Homayoun, Houman

arXiv.org Artificial IntelligenceOct-30-2025

Forget and Rewire (FaR) methodology has demonstrated strong resilience against Bit-Flip Attacks (BFAs) on Transformer-based models by obfuscating critical parameters through dynamic rewiring of linear layers. However, the application of FaR introduces non-negligible performance and memory overheads, primarily due to the runtime modification of activation pathways and the lack of hardware-level optimization. To overcome these limitations, we propose FaRAccel, a novel hardware accelerator architecture implemented on FPGA, specifically designed to offload and optimize FaR operations. FaRAccel integrates reconfigurable logic for dynamic activation rerouting, and lightweight storage of rewiring configurations, enabling low-latency inference with minimal energy overhead. We evaluate FaRAccel across a suite of Transformer models and demonstrate substantial reductions in FaR inference latency and improvement in energy efficiency, while maintaining the robustness gains of the original FaR methodology. To the best of our knowledge, this is the first hardware-accelerated defense against BFAs in Transformers, effectively bridging the gap between algorithmic resilience and efficient deployment on real-world AI platforms.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.24985

Country:

North America > United States > California (0.46)
North America > United States > Arizona (0.28)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture

Prakash, Shvetank, Cheng, Andrew, Tschand, Arya, Mazumder, Mark, Gohil, Varun, Ma, Jeffrey, Yik, Jason, Wan, Zishen, Quaye, Jessica, Alvanaki, Elisavet Lydia, Kumar, Avinash, Mazumdar, Chandrashis, Khare, Tuhin, Ingare, Alexander, Uchendu, Ikechukwu, Ghosal, Radhika, Tyagi, Abhishek, Wang, Chenyu, Garavagno, Andrea Mattia, Gu, Sarah, Guo, Alice, Hur, Grace, Carloni, Luca, Krishna, Tushar, Nayak, Ankita, Yazdanbakhsh, Amir, Reddi, Vijay Janapa

arXiv.org Artificial IntelligenceOct-28-2025

The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 72%) on these advanced questions, highlighting persistent gaps in architectural reasoning across analysis, design, and implementation QAs. By holistically assessing fundamental skills, QuArch provides a foundation for building and measuring LLM capabilities that can accelerate innovation in computing systems. With over 140 contributors from 40 institutions, this benchmark represents a community effort to set the standard for architectural reasoning in LLM evaluation.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.22087

Country:

Asia (0.67)
North America > United States > Texas (0.28)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education (1.00)
Semiconductors & Electronics (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving

Lee, Jungi, Park, Junyong, Cha, Soohyun, Cho, Jaehoon, Sim, Jaewoong

arXiv.org Artificial IntelligenceOct-17-2025

Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.14557

Country: Asia (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework

Stojkovic, Jovan, Zhang, Chaojie, Goiri, Íñigo, Bianchini, Ricardo

arXiv.org Artificial IntelligenceOct-1-2025

The rapid rise of large language models (LLMs) has been driving an enormous demand for AI inference infrastructure, mainly powered by high-end GPUs. While these accelerators offer immense computational power, they incur high capital and operational costs due to frequent upgrades, dense power consumption, and cooling demands, making total cost of ownership (TCO) for AI datacenters a critical concern for cloud providers. Unfortunately, traditional datacenter lifecycle management (designed for general-purpose workloads) struggles to keep pace with AI's fast-evolving models, rising resource needs, and diverse hardware profiles. In this paper, we rethink the AI datacenter lifecycle scheme across three stages: building, hardware refresh, and operation. We show how design choices in power, cooling, and networking provisioning impact long-term TCO. We also explore refresh strategies aligned with hardware trends. Finally, we use operation software optimizations to reduce cost. While these optimizations at each stage yield benefits, unlocking the full potential requires rethinking the entire lifecycle. Thus, we present a holistic lifecycle management framework that coordinates and co-optimizes decisions across all three stages, accounting for workload dynamics, hardware evolution, and system aging. Our system reduces the TCO by up to 40\% over traditional approaches. Using our framework we provide guidelines on how to manage AI datacenter lifecycle for the future.

datacenter, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.26534

Country: North America > United States (0.46)

Genre:

Overview (0.67)
Research Report (0.41)

Industry:

Information Technology > Services (1.00)
Energy > Power Industry (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

Coordinated Reinforcement Learning Prefetching Architecture for Multicore Systems

Siddiqui, Mohammed Humaid, Guzman, Fernando, Wu, Yufei, Ann, Ruishu

arXiv.org Artificial IntelligenceSep-16-2025

Hardware prefetching is critical to fill the performance gap between CPU speeds and slower memory accesses. With multicore architectures becoming commonplace, traditional prefetchers are severely challenged. Independent core operation creates significant redundancy (up to 20% of prefetch requests are duplicates), causing unnecessary memory bus traffic and wasted bandwidth. Furthermore, cutting-edge prefetchers such as Pythia suffer from about a 10% performance loss when scaling from a single-core to a four-core system. To solve these problems, we propose CRL-Pythia, a coordinated reinforcement learning based prefetcher specifically designed for multicore systems. In this work, CRL-Pythia addresses these issues by enabling cross-core sharing of information and cooperative prefetching decisions, which greatly reduces redundant prefetch requests and improves learning convergence across cores. Our experiments demonstrate that CRL-Pythia outperforms single Pythia configurations in all cases, with approximately 12% IPC (instructions per cycle) improvement for bandwidth-constrained workloads, while imposing moderate hardware overhead. Our sensitivity analyses also verify its robustness and scalability, thereby making CRL-Pythia a practical and efficient solution to contemporary multicore systems.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2509.10719

Country: North America > Canada (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback